Study of the Relationship of Training Set Size to Error Rate in yet Another Decision Tree and Random Forest Algorithms
نویسنده
چکیده
ii ACKNOWLEDGMENTS I thank Almighty god for providing the strength and knowledge to pursue the research work. I would like to express my sincere gratitude to Dr. Susan Mengel, the chairperson of my committee for guiding through the research. In spite of her busy schedule, she was very helpful in solving the problems and presented new insights. It was a pleasure to work alongside with her which has helped shaped my academic objectives. I am glad to have Dr. Yu Zhuang as my committee member. He has provided me immense moral support during the entire course of my research. I would like to thank him for his full cooperation and support for successful completion of my thesis research. I would like to thank Ms. Colette Solpietro and other members at Office of Research Services for the support and confidence towards me. Finally, I would like to thank the most important people in my life, my parents, and my brother. I am here due to the immeasurable confidence and sacrifices made by them for me. I hope this work stands up to the high standards you have always expected of me. iii ABSTRACT Classification algorithms are the among the widely used data mining techniques for prediction. Among their different types, the decision tree is a classification predictive model with significant advantages over the other techniques by being easy to interpret, having quick construction, having high accuracy and using fewer resources. The decision tree model can be developed by algorithms like C4.5, CART, YaDT, and Random Forest where their performance is determined by error rates. This thesis research studies the relationship of training data size to error rate for the YaDT and Random Forest algorithms, and also compares the performance of both of them with the results of C4.5 & CART. This thesis research has been helpful in drawing various conclusions. For example, the well accepted 66.7:33.3 splitting ratio in the literature can be increased to 80:20 for large data sets with more than 1000 samples to generate more accurate decision tree models. The stability of all algorithms in the research is weak after 90:10 ratios due to very little testing data. This thesis research reveals that while YaDT performs similarly to C4.5 and CART, the performance of Random Forest is better than the other three significantly. The performance of models can be determined optimally with large data sets.
منابع مشابه
کاربرد الگوریتمهای دادهکاوی در تفکیک منابع رسوبی حوزۀ آبخیز نوده گناباد
Introduction: Reduction of sediment supply requires the implementation of soil conservation and sediment control programs in the form of watershed management plans. Sediment control programs require identifying the relative importance of sediment sources, their quantitative ascription and identification of critical areas within the watersheds. The sediment source ascription is involves two...
متن کاملApplication of ensemble learning techniques to model the atmospheric concentration of SO2
In view of pollution prediction modeling, the study adopts homogenous (random forest, bagging, and additive regression) and heterogeneous (voting) ensemble classifiers to predict the atmospheric concentration of Sulphur dioxide. For model validation, results were compared against widely known single base classifiers such as support vector machine, multilayer perceptron, linear regression and re...
متن کاملPersonal Credit Score Prediction using Data Mining Algorithms (Case Study: Bank Customers)
Knowledge and information extraction from data is an age-old concept in scientific studies. In industrial decision-making processes, the application of this concept gives rise to data-mining opportunities. Personal credit scoring is an ever-vital tool for banking systems in order to manage and minimize the inherent risks of the financial sector, thus, the design and improvement of credit scorin...
متن کاملمطالعات درخت تصمیم در برآورد ریسک ابتلا به سرطان سینه با استفاده از چند شکلیهای تک نوکلوئیدی
Abstract Introduction: Decision tree is the data mining tools to collect, accurate prediction and sift information from massive amounts of data that are used widely in the field of computational biology and bioinformatics. In bioinformatics can be predict on diseases, including breast cancer. The use of genomic data including single nucleotide polymorphisms is a very important ...
متن کاملDetermining Factors Influencing Length of Stay and Predicting Length of Stay Using Data Mining in the General Surgery Department
Background: Length of stay is one of the most important indicators in assessing hospital performance. A shorter stay can reduce the costs per discharge and shift care from inpatient to less expensive post-acute settings. It can lead to a greater readmission rate, better resource management, and more efficient services. Objective: This study aimed to ident...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006